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Abstract 

Stochastic gradient descent is a simple appproach to find the local min- 
ima of a function whose evaluations are corrupted by noise. In this paper, 
mostly motivated by machine learning applications, we develop a procedure 
extending stochastic gradient descent algorithms to the case where the func- 
tion is defined on a Riemannian manifold. We prove that, as in the Euclidian 
case, the descent algorithm converges to a critical point of the cost function. 
The algorithm has numerous potential applications, and we show several 
well-known algorithms can be cast in our versatile geometric framework. 
We also address the gain tuning issue in connection with the tools of the 
recent theory of symmetry-preserving observers. 

1 Introduction 

Stochastic approximation provides a simple approach, of great practical impor- 
tance, to find the local minima of a function whose evaluations are corrupted by 
noise. It has had a long history in optimization and control with numerous ap- 
plications ( [[191 Ul EQl, to cite a few). To set the main ideas on a toy example, 
we briefly mention an ancient procedure to optimize the ballistic trajectory of a 
projectile in a fluctuating wind. Successive gradient corrections (i.e. corrections 
proportional to the distance between the endpoint and the target) are performed 
on the angle at which the projectile is launched. With a decreasing step tending to 
zero, one can reasonably hope the launching angle will converge to a fixed value, 
that moreover optimizes the average cost function. One of the first formal algo- 
rithm of this kind is the Robbins-Monro algorithm [1261 . which dates back to the 
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fifties. It proves that for a smooth loss function C{w) having a unique minimum, 
the algorithm w t+ i = w t — Jtht{w t ), where h t (w t ) is a noisy evaluation of the 
gradient of C at Wt, converges in quadratic mean to the minimum, under specific 
conditions on the sequence 7$. 

Although stochastic gradient meets applications in control, system identifi- 
cation, and filtering theories (for instance a Kalman filter for noisy observations 
of a constant process is a Robbins-Monro algorithm), new challenging applica- 
tions stem from the active machine learning community. The work of L. Bottou, a 
decade ago [|T2l . has popularized the stochastic gradient approach, both to address 
the online learning problem (identification of a constant parameter in real time 
from noisy output measurements) and large-scale learning (with ever-increasing 
data-sets, approximating the loss function can lead to a reduced numerical com- 
plexity). Some recent problems have been strong drivers for the development of 
new estimation methods, such as the one proposed in the present paper, dealing 
with stochastic gradient descent on manifolds. 

The paper is organized as follows. In Section [2] a stochastic gradient descent 
algorithm on Riemannian manifolds is proposed. Particular cases of this general 
procedure have already been used in 112211211 to address topical machine learning 
problems, where the non-linear parameter search space can be identified to a Rie- 
mannian manifold. Those geometric algorithms have proved to favorably compete 
with state of the art methods, but proofs of convergence are still lacking. Indeed, 
in the Euclidian case, almost sure convergence of the parameter to a critical point 
of the gradient of the cost function is well-established under reasonable assump- 
tions (see e.g. lfT2lO . but it has never been proved for non-Euclidian spaces. In this 
paper, a proof of convergence of the general algorithm is proposed. 

In Section [3] we show that several well-known algorithms can be cast in our 
general geometric framework. This stems from the fact that a wide variety of non- 
linear optimization problems can be re-formulated as optimization on manifolds 
HI O. We prove the celebrated Oja flow ll24ll and Amari [3] natural gradient 
(that have been successfully applied in neural networks and signal processing) 
are particular cases of our general procedure. We also connect our approach to 
applications in medical imaging (see e.g. If25l ) and radar processing §6$, that have 
motivated the development a stochastic gradient algorithm recently introduced in 
dUl to find the barycenter of a probability measure on a manifold. A new intrinsic 
position filtering procedure for a class of mechanical Lagrangian systems is also 
introduced. Finally, we discuss an application to non-linear consensus. 

In Section IH we review the problems, and some algorithms, of 11221 . The goal 
is to identify a positive semi-definite matrix (kernel, Mahalanobis distance) from 
noisy measurements. The theoretical convergence results of Section [2] allow to 
complete the work of [22J. 

In Section \5\ the tuning issue is addressed. As the proposed algorithms can 
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have a very non-linear structure, the performances can be very sensitive to the 
choice of step size and change of data units. We propose a robustification proce- 
dure inspired by the recent theory of symmetry -preserving observers ifTOl . 

The contributions of this paper are threefold. The main contribution is to prove 
the convergence of the proposed general stochastic gradient algorithm on Rieman- 
nian manifolds. Then, it is showed the procedure has numerous applications, and 
allows to either rediscover known algorithms, or derive new ones, or opens up for 
new algorithms of potential interest. The applications are mentioned rather briefly, 
as a detailed treatment of each of them would go beyong the scope of this general 
paper. Finally, a general procedure of practical importance is proposed, in order 
to robustify the algorithms with respect to scalings of inputs and parameters. The 
appendix contains a brief differential geometry recap. Preliminary results can be 
found in EH H. 

2 Stochastic gradient on Riemannian manifolds 

2.1 Standard stochastic gradient in W 1 

Let C(w) = M z Q(z,w) = J Q(z,w)dP(z) be a three times differentiable cost 
function, and consider the optimization problem 

mmC(w) (1) 

where w E M. is a minimization parameter belonging to a Riemannian manifold 
Ai, and z is an event which belongs to a measurable space Z, modeled as a 
random variable drawn from an unknown probability law dP on a probablized 
space fi. Q(z, w) is a loss function which can be viewed as an approximation of 
the average cost function C, evaluated under the input z. The stochastic gradient 
algorithm at each step receives an input z t and performs a gradient descent on 
the approximated cost w t +i = w t — 'jtH(z t ,Wt) where in average ¥, z H(z,w) = 
VC(w). As C is not convex in many applications, one can not hope for a much 
better result than almost sure (a.s.) convergence of C(w t ) to some value C^, 
and of VC(w t ) to 0. Such a result holds under a set of standard assumptions, 
summarized in e.g. lfT2l . 

2.2 Limits of the approach: a motivating example 

A topical problem that has attracted a lot of attention in the machine learning com- 
munity over the last years is low-rank matrix estimation (or matrix completion, 
which can be viewed as the matrix counterpart of sparse approximation problems) 
and in particular the collaborative filtering problem: given a matrix W*a containing 
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the preference ratings of users about items (movies, books), the goal is to com- 
pute personalized recommendations of these items. Only a small subset of entries 
E H is known, and there are many ways to complete the matrix. A standard 
approach to overcome this underdetermination, and to filter the noise, is to con- 
strain the state space supposing the tastes of the users are explained by a reduced 
number of criterium (say, r). This yields the following non-linear optimization 
problem 

min V (Wu - Wa) 2 s.t. rankfWO = r 

The matrix being potentially of high dimension (d\ ~ 10 5 , d 2 — 10 6 in the so- 
called Netflix prize problem), a standard method to reduce the computational bur- 
den is to draw random elements of S, and perform gradient descent ignoring the 
other entries. Unfortunately the updated matrix W — 7*Vw(Wy — Wij) 2 does 
not have rank r. Seeking the rank r matrix which approximates it best can be a 
costly operation, especially for very large di,d 2 . A more natural way to enforce 
the rank constraint is to endow the parameter space with a Riemannian metric, 
and to perform a gradient step within the manifold of fixed-rank matrices. In lETTl 
this approach has lead to stochastic gradient algorithms that compete with state of 
the art methods. Yet a convergence proof is still lacking. The convergence results 
below are general. In Section |4] they will be shown to apply to this problem for 
the particular case of W* symmetric positive definite. 



2.3 Proposed stochastic gradient algorithm 

On a Riemannian manifold Ai, we propose to replace the usual update with the 
following intrinsic update 

It 

w t+ i = exp Wt (-j^—jH(z t ,w t )) (2) 

where exp„, is the exponential map at w and H(z, w) is the Riemannian gradient 
of the cost function at w evaluated under the event z (see Appendix). 

In many cases, the exponential map is not easy to compute (a calculus of 
variations problem must be solved), and it is much easier and much faster to use 
a first-order approximation of the exponential, called a retraction. A retraction 
R w (v) : T W M !->■ M maps the tangent space at w to the manifold, and it is such 
that d(R w (tv),exp w (tv)) = 0(t 2 ). It yields the alternative update 

w t+ i = R Wt (-—^-H(z t ,w t )) (3) 
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2.4 Convergence analysis 

Let D{w\,W2) = d 2 (wi, W2) be the squared Riemannian distance. Consider the 
following assumptions, that can be viewed as an extension of the usual ones lfT2~ll : 

1 . Yl it < 00 and J2 it = +00. 

2. There is a point w* e Ai and S > such that the opposite of the gradient 
points towards w* when d(w, w*) becomes large enough i.e. 

inf (exp- 1 K),VCH)<0 

D(w,w*)>S 

3. There is E > S such that the gradient is bounded around w*, i.e. 

Wz \\H(z,w)\\ < A for D(w,w*) < E 

4. Ai is connected, the sectional curvature is lower bounded by k, < and the 
injectivity radius is lower bounded by / > 0. 

5. One can define / : M. >->■ R in © and © as a continuous function that 
satisfies 

/H 2 > max{l,E 2 (||iJ(^«;)|| 2 (l + ^\{^D{w, w*) + ||if(s,tw)||)))} 

Only the last two assumptions are specific of the Riemannian case. They are 
very mild assumptions that encompass the Euclidian case. In this latter case as- 
sumption 5 is usually replaced with the more restrictive condition E z ( || H (z, w) || 2 ) < 
A + £>||u>|| 2 . In fact, our general procedure based on the adaptive step 7t//(itft) 
allows to relax this standard condition, in the Euclidian case also. Assumption 5 
is not very restrictive, as it is to be expected in applications the term that / must 
dominate is a continuous function of the parameter itself. 

Theorem 1 Let M. be a Riemannian manifold. Consider the optimization prob- 
lem CD- Under assumptions 1-5, the algorithm © is such that C{w t ) converges 
a.s. andVC{w t ) — > a.s. 

The following proof builds upon the Euclidian case [12]. Let us explain what 
are the specific difficulties of the non-Euclidian case. Consider the update ©. 
For simplicity's sake let H denote the gradient H(z, w). Then w t , w t+ i,w* is a 
geodetic triangle. In the Euclidian case D(w t+ i,w*) = D(w t ,w*) + 7 2 ||iJ|| 2 + 
7t||iy|| *jD{w t , w*)cos a. As long as H is pointing towards w* i.e. cos a < the 
increase in squared length is less than 7 2 ||if|| 2 (which is a summable sequence 
if 1 1 if || is bounded). This a key property to prove the trajectories are bounded. 
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But on a manifold of negative curvature, the Alexandrov theorem states that even 
if cos a < 0, the increase in the squared distance can be much larger than the 
squared length 7 2 \\H\\ 2 (Pythagoras' theorem no longer holds). The specific term 
f(w) is meant to compensate those curvature effects. The proof is achieved in two 
steps. 

Compactness of trajectories: We prove the trajectories asymptotically remain 
in a compact set, which generally is the most involved part in stochastic approxi- 
mation. Consider update ©. Let 

h t = ma.x(E, D(w t , w*)) 

Let us prove it converges a.s. to E. A second order Taylor expansion yields 

h t+1 -h t < 2-^-(H(z t ,w t ),exp^(w*)) + {-^-f\\H{z u w t )fh ( 4 ) 
fiwt) f{w t ) 

where k\ is an upper bound on the half of the Riemannian hessian of D(-,w*) 
along the geodesic joining w t to w t+ i (see the Appendix). If the sectional curva- 
ture is lower bounded by k < we have, according to a recent result lfT5l 



V w {D{w,w )/2) < ■== < y/\K\D(w,w*) + 1 

tanh(i v /|K|L'(w, w*)) 



All along the geodesic linking w t and w t +i triangular inequality implies \J D{w, w*) < 
y^D(w t , w*) + \\H(z t , w t )\\ as f(w t ) > 1 and t can be assumed to be large enough 
so that 7t < 1. Let F t be the increasing sequence of a-algebras generated by the 
several variables available just before time t: 

F t = {z , ■■■ , zt-i,w , ■■■ , w t , 7q, • • • , 7t} 

As z t is independent fromF*, we have EfC^^y) 2 ^) || | ,F t ] = ( 7 ^ y ) 2 E,(A; 1 ||if(^,«; t )|| 2 ). 

Conditioning © to F t , and using assumption 5: 

E[h t+1 - h t \F t ] < 2j^(VC(w t ),e W -l(w*))+l? (5) 

This allows to conclude this paragraph as in [fT2ll . The first term of the right 
member can be safely removed as either D(w t +i, w*) is smaller than E, or if it is 
larger, either D(w t , to*) > E and the term is negative because of assumption 2, or 
D(w t , w*) < E but in this case for j t sufficiently small assumption 3 implies that 
D(w t ,w*) > S. Finally E(h t+1 - h t \F t ) < 7 t 2 . Thus h t + Y,7 ll is a positive 
supermartingale, hence it converges a.s. Let us prove it necessarily converges to 
E. 



6 



Proposition 1 [Fisk (1965)] Let (X n ) neN be a non-negative stochastic process 
such that ^]™E((X n+1 — X n )l^ Xn+1 -x n )>o}\F n ) < oo then the process is a 
quasi-martingale, i. e. 

oo 

|E[X n+ i — X n |F n ] | < oo a.s. , and X n converges a.s. 

o 

The name stems from the fact that the sum above is exactly zero if the process 
is a martingale. We have £ E((/i t+1 - h t )l {ht+1 ^ ht>0} \F t ) < oo since E(h t+1 - 
h*\F t ) < 7 2 . Thus h t is a quasi-martingale. According to © we have inequality 
"2 Z% jftjTO, exp" 1 («;*)) < £J 7? - - k|F t ] and as is 

a quasi-martingale we have a.s. 

oo oo oo 

*0 to to 

(6) 

Consider a sample trajectory for which h t converges to E' > E. Because of 
assumption 2 for t large enough we have (VC(w t ), exp~*(u>*)) < — e < 0. This 
contradicts © as £^7t = oo, and f(w t ) is upper bounded as the trajectory 
remains in a compact set. 

Algorithm convergence: Almost every trajectory asymptotically enters the ball 
of center w* and radius E and stays inside it. Let us prove that we can work on 
a fixed compact set. Let G n = f] t>n {D(w t , w*) < E}. We have just proved 
P(U G n ) = 1. To prove a.s. convergence on fi, it is thus sufficient to prove 
a.s. convergence on each of those sets. We assume from now on the trajectories 
all belong to the ball of center w* and radius E. As this is a compact set, all 
continuous functions of the parameter can be bounded. In particular jt/ks < 
7t//( w t) < 7t an d the distortion induced by the curvature does not change the 
convergence analysis on the compact. Indeed, as in lfT2l . let 

9t = C{w t ) > 

A Taylor expansion yields 

9t + i ~9t< -2j^-{H(z t ,w t ),VC(w t )) + (j^f\\H(z t ,w t )\\ 2 k 2 ( 7 ) 

where k 2 is an upper bound on the Riemannian Hessian of C in the compact set. 
We have thus E(g t+1 — g t \F t ) < — 2^|| VC(w t )\\ 2 + ^ 2 k 2 as assumption 5 implies 
f{wt) 2 > E z (\\H(z tl w t )\\ 2 ). This proves g t = C(w t ) is a quasi-martingale. Once 
again it implies a.s. J2T j t \\ VC(w t ) || 2 < oo. Thus if VC(to t ) converges it must 
converge to zero. 
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Now let p t = ||VC(wt)|| 2 . Bounding the second derivative of ||VC|| 2 by k 5 , 
along the geodesic linking w t and Wt+i, a Taylor expansion yields p t+1 — p t < 
-2j^(VC(w t ),(V 2 Wt C)H(z t ,w t )} + (j^) 2 \\H(z t ,w t )\\ 2 h, and thus lower 
bounding the Hessian of C on the compact set by — k^ we have M(p t+ i — Pt\F t ) < 
27t || VC(wt) \\ 2 k± + 7t 2 fc5. We just proved the sum of the right member is finite. 
It implies a.s. convergence of p t towards a value p^ which can only be 0. 

Theorem 2 Suppose R w is a twice differentiable retraction with continuous deriva- 
tives on a Riemannian connected manifold Ai. We define in this case r(w) = 
gup ^ d( Rw (sv),e xPw (sv)) . = 1; s < /}_ A ig 0rithm © is such that C(w t ) con- 
verges a.s. and VC{w t ) — >■ a.s. under assumptions 1-4 and assumption 5 is 
replaced by: there exists a continuous function f : M. — > M. such that f{w) 2 > 
max{l, a(w),b(w),c(w)} with a(w) = 2r(w)E z (\\H(z,w)\\ 2 (l + y/D(w,w*) + 

\\H(z,w )\\)),b(w) = (E z (r(™) 2 \\H(z,w)\\*)) 1/2 , c(w) = E z (\\H(z, w) || 2 (1 + 

y/\tf(y/D(w,W*) + \\H(z,w)\\))). 

The proof is based on the proof of Theorem 1 with a few differences that we 
shall explain briefly in the following. Let H denote H(z t , w t ). We have w t +i = 
R Wt (—j^H). Let u>^i = exp Wt (—j^^H). The proof essentially relies on the 
fact that the points w t +i and w^, are close to each other on the manifold. Indeed, 
for t sufficiently large 7 t < I and d(w^ p v w t+1 ) < r(w t )(y^) 2 \\H\\ 2 . 

Let h t+ i = D(w^ p 1 ,w*) and h t = D(w t ,w*). Squaring the triangular in- 
equality d(w t+1 , w*) < d(w t+1 , w%%) + d(w e t ^, w*) we get D(w t+1 , w*) - h t < 
h t+ i - h t + 2d(w t+ i, wt+Oy'ht+i + D(w e t ^ w t+ i) where by triangular inequal- 
ity Wht+i < Vh~i + \\H\\, for sufficiently large t. But for 7 t small enough 
d(wt+i, w t+ i) < r(u; t )(y^) 2 ||if || 2 . The assumption on /implies E(D(w t+1 ,w*) — 
h t \F t ) < E(h t+ i - h t \F t ) +7 2 + 7 t 4 . We still have © and thus © with additional 
summable terms. The confinement results thus still hold. 

© must also be modified: C{w t+l ) - C(w t ) < \C(w t+1 ) - C(w^))\ + 
C(w^i) — C(w t ). The latter term of the right member can be bounded as in © 
and the first one by k 5 r(w t )(j^^) 2 \\H\\ 2 (and thus in average by k^ 2 ) where k 5 
is a bound on the Riemannian gradient of C . Thus C(w t ) is a quasi-martingale 
and lt\\^C(w t ) \\ 2 < oo. One can conclude by the same token on p t = 

I|vck)|| 2 . 

3 Examples 

In this section, we briefly mention several applications that can be cast in our 
versatile geometric framework. The convergence of some well-known algorithms 
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stems from the theorems above. We also derive a new algorithm for intrinsic 
tracking of the configuration of a Lagrangian system, and we show our procedure 
potentially leads to new algorithms in non-linear consensus. The general approach 
is the following: 1- identify the manifold and the cost function involved, 2- endow 
the manifold with a Riemannian metric and a retraction, 3- derive the stochastic 
gradient algorithm 4- analyze the set defined by VC(w) = 0. 

3.1 Subspace tracking 

We propose to first revisit the well-known subspace tracking algorithm of Oja 
[|24l which is a generalization of the power method for computing the dominant 
eigenvector. In several applications, one wants to compute the r principal eigen- 
vectors, i.e. perform principal component analysis (PCA) of a n x n covariance 
matrix A, where r < n. Furthermore, for computational reasons or for adaptive- 
ness, the measurements are supposed to be a stream of n-dimensional data vectors 
zi, • • • Zt, • • • where E(z t zJ) = A (online estimation). The problem boils down to 
estimating an element of the Grassman manifold Gr(r, n) of r-dimensional sub- 
spaces in a n-dimensional ambiant space, which can be identified to the set of 
rank r projectors: 

Gr(r, n) = {P e R" x ™ s.t. P T = P, P 2 = P, Tr (P) = r}. 

Those projectors can be represented by matrices WW T where W belongs to the 
Stiefel manifold St(r, n), i.e., matrices of M. nxr whose columns are orthonormal. 
Define the cost function expressing the error made projecting on a subspace 

C(W) = h& z [\\WW T z- z\\ 2 } 

It is invariant to rotations W h-> WO, O E 0(r). The state-space is therefore the 
set of equivalence classes [W] = {WO s.t. O G 0(r)}. This set is denoted by 
St(r, n)/0(r). It is a quotient representation of the Grassman manifold Gr(r, d). 
This quotient geometry has been well-studied (e.g. [16]). In this representation 
the gradient under the event z is: H(z, W) = (I — WW T )zz T W . The exponential 
map admits a closed form expression in this geometry. However a computation- 
ally much more efficient retraction is given by Rw (^H)=qf{W + jH) where qf() 
extracts the orthogonal factor in the QR decomposition of its argument. It is an 
infinitely many differentiable retraction [|T]|. With this retraction © writes 

W t+1 = qf(W t + 7t (J - W t Wj)z t zJW t ) (8) 

This algorithm (without the "qf ' operator) is known as the Oja's vector field for 
subspace tracking. It has been proved to converge to an invariant subspace of the 
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co variance matrix A for several retractions under the usual assumption (1) on the 
step. With our approach, convergence to an invariant subspace is directly proved 
for any twice differentiable retraction (e.g. with the exponential map, or the qf() 
operator, for which convergence has only been proved for the averaged algorithm, 
but not for the online one to the author's knowledge). Indeed, Theorems 1 and 2 
apply, as the Grassman manifold has non-negative curvature l|29ll , and St(r, n) is 
a compact set, implying continuous functions of W are upper bounded. It proves 
algorithms of the form ©, ©, and in particular ®, converge to a point such that 
VC(W) = 0, i.e. AW = WW T AW. For such points there exists M such that 
AW = WM, proving W is an invariant subspace of A. A local analysis proves 
the dominant eigenspace is the only stable subspace of the averaged algorithm 

nam. 

3.2 Positive matrices geometric mean and filtering 

Positive definite matrices are fundamental tools in many areas of engineering and 
applied mathematics. They represent covariance matrices in statistics and filter- 
ing, kernels and Mahalanobis distances in machine learning, and diffusion tensors 
in medical imaging. Over the recent years, the natural metric of the positive cone 
ofnxn positive definite matrices P n has become a popular matrix distance in ap- 
plications. The Karcher mean of k matrices Pi, • • • , Pk is defined as a minimizer 
of | Yli=i d 2 (W, Pi) where d is the geodesic distance. Its characteristics allow it 
to be referred to as a "geometric mean" IfTTl . Indeed it generalizes the geometric 
mean of two scalars in several ways (in particular the mean of two matrices of Pi 
is the usual scalar geometric mean), a feature that leads to many desirable proper- 
ties. Developping methods to compute it efficiently has been an active domain of 
research, and a closed formula for the mean of more than three matrices is still an 
open problem. 

Theorem 1 provides a numerical procedure to compute the Karcher mean in 
P n . Indeed let the cost be C(W) = J2i=iQ( W > p i) = \ £i=i p i)- The 
algorithm © writes W t +\ = exp Wt (— jt exp^(P t )) where P t is one of the P/s 
drawn randomly with uniform probability. W t moves on the geodesic linking W t 
and P t towards P t with distance j t d(Wt, Pt) (see Appendix). P n is a Hadamard 
manifold ll25Tl . Thus the injectivity radius is infinite, and the Karcher mean W* is 
unique and it is the only critical point of the cost C(W) . Although the assumptions 
above seem to be satisfied, the compactness of trajectories can be proved very 
simply [HI. Indeed if B is a geodesic ball containing the P/s there exists a compact 
set K containing the ball such that if W t E B then W t +i E K, as j t — > 0, and H 
is bounded in K. As a result the parameter can not escape K as either W t+ \ lies 
between W t and P t and is inside K by convexity, or it is beyond P t and thus is 
in K as P t E B. This is why we let f(W t ) = 1. All the assumptions required in 
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the second part (convergence) of the proof of Theorem 1 are satisfied and thus we 
have a.s. convergence. 

The geometric mean has proved useful in several filtering applications. For in- 
stance, in E51 noisy IRM images are processed. Each voxel is a diffusion tensor 
represented by a matrix of P 3 . Filtering (i.e. averaging) with the natural met- 
ric is more robust to outliers (as geometric means versus arithmetic means). In 
the same spirit, filtering radar covariance images with the natural metric of P n 
seems promising [6] . Note also that the more theoretical work [4] has very re- 
cently introduced a stochastic gradient algorithm to compute probability measure 
barycenters, which applies to this problem, and their approach can also be cast in 
our framework. 

3.3 An intrinsic position filter for a class of mechanical systems 

Consider the classical mechanical system with n degrees of freedom described by 
the Lagrangian 

c{qA) = \giM)<i<i j - u (<i) ( 9 ) 

where the generalized position q is an element of the configuration space Ai, 
which is a manifold, and is written in the local coordinates (<f )j=i... n - U : Ai i— >■ M 
is the potential energy. g(q) = (gij(q))i<ij< n is the Riemaniann metric on Ai 
induced by the kinetic energy T (recall C = T — U), and is thus intrinsically 
defined, i.e. it does not depend on any particular choice of coordinates. (21 [8]| 
have used this metric to derive intrinsic observers for Lagrangian systems whose 
position is measured. The (non-intrinsic) problem has also been studied recently 
in@. 

We propose a method to track intrinsically the configuration of a Lagrangian 
system in a steady state, whose position is measured with noise. Suppose at each 
time step t = 1, 2 ■ • • the measured output is z t E Ai. We suppose the mea- 
surement noise is bounded, and centered, where we define the latter properties 
as follows. Let dP be a probability measure on Ai, with support contained in 
a geodesic ball containing q. We assume that z±, Z2, • • • is a sequence of ran- 
dom variables drawn with unknown law dP, and q is the unique minimizer of 
x i — y J d 2 (x, z)dP(z) where d is the Riemannian distance. LetC(w) = ^d 2 (q,w), 
Q(z,w) = 7jd 2 (z,w). H(z t) w t ) = exp"^^) is the gradient of the half squared 
Riemannian distance. Algorithm © writes w t +i = exp Wt (—y^exp~l(z t )). It 
is a first-order intrinsic filter on the manifold Ai. It allows to track the position 
as first-order filters do. Moreover the algorithm and the convergence speed are 
totally intrinsic. 
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We see this filtering problem boils down mathematically to compute the barycen- 
ter of a probability measure. For this problem, © is the same as flU, which proves 
a.s. convergence for a small enough step and an initialization inside the support 
of dP. When a retraction is used, one can thus conjecture Theorem 2 similarly 
holds, but this still needs to be checked for every mechanical system considered. 

3.4 Non-linear gossip algorithm for distributed averaging on 
manifolds 

The problem of computing distributed averages on a network appears in many ap- 
plications, such as multi-agent systems, distributed data fusion, and decentralized 
optimization. A way to compute distributed averages that has gained popularity 
over the last years is the so-called gossip algorithm [13]. It is a randomized pro- 
cedure where at each iteration, a node of, say, an undirected graph, communicates 
with one neighbor. Both nodes set their value equal to the average of their current 
values. The goal is that all nodes reach a common average value as quickly and 
efficiently as possible, with limited computational power, and an arbitrary net- 
work topology. Gossip algorithms are a special case of distributed optimization 
algorithms, where stochastic gradient descent plays a central role [|30l . 

When applied to multi-agent systems, this algorithm allows the agents to reach 
a consensus, i.e. agree on a common quantity. This is the well known consensus 
problem [|23l . for which gossip algorithms provide the computation of the exact 
average on the agent's values, which is a special case of reaching a consensus on 
some intermediate value. When the consensus space is not linear (for instance a 
group of agents wants to agree on a common motion direction, or a group of oscil- 
lators on a common phase) the methods need to be adapted to the non-linearities 
of the problem (see e.g. I128T0 . In particular, a gossip algorithm on the circle has 
been proposed and studied in Il27ll . 

The general stochastic gradient algorithm proposed in this paper provides a 
randomized procedure for some non-linear consensus problems where the goal is 
to reach the Riemannian mean of the agent's values. Indeed, consider the cost 
function C(x\, ■ ■ ■ , x k ) = J2 1<i<j<k d 2 {x^ Xj) on M k . Suppose the communi- 
cation graph is undirected. At each time two agents randomly selected 
with uniform probability. Algorithm © writes Xi exp x .(— j t ex Px 1 ( x j)) an d 
xj <- exp x .{-'y t exp-i{x i )). 

If Ai is a Hadamard manifold, a consensus is reached. Indeed, the trajectories 
remain in a compact set by exactly the same token as in subsection 13.21 Thus 
Theorem 1 applies, and the agents reach a critical point of the averaged cost. On 
such manifolds the Karcher mean of k points Xi, ■ • • , x k is unique (e.g. flU), and it 
is the unique solution of V m (^ K j<k d 2 (w, Xj)) = 0. Thus, as V X1 C = we have 
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^ x 1 (^2i<j< k d 2 (xi,Xj)) — 0, implying x\ is the Karcher mean of 
This proves by a similar token that all agents are the Karcher mean of x±, ■ • ■ ,x k - 
As it is unique, it proves they converge a.s. to a consensus. For instance, this 
procedure could allow to reach a consensus on the set of positive definite matrices 
in a decentralized randomized way. 

The method could also be applied to the more general case of geodesically 
complete manifolds, but unfortunately the theory does not readily allow to prove 
convergence. The circle, the sphere, and the group of rotations, fall into this cate- 
gory. Those manifolds are characterized by the fact that the definition domain of 
every geodesic can be extended to E, i.e. the boundary can not be reached in finite 
time. On such manifolds, we know from the Hopf-Rinow theorem there exists at 
least one minimizing geodesic between two points, and the cut locus (roughly 
speaking the set where a geodesic stops to be minimizing) is of null measure [|25l . 
However, the theory does not readily applies as the approximated cost function 
d 2 (xi, Xj) is not differentiable at the cut locus. One way to overcome this diffi- 
culty in practice, is to ignore the agent Xj when it belongs to the cut locus at Xi, and 
draw another neighbor. The intuition suggests this should not prevent convergence 
as the cut locus at a point is of null measure in the state space. Besides, igoring Xj 
is equivalent to applying the algorithm with V Xi d 2 (xi, Xj) = V Xj d 2 (xi, Xj) = 0, 
which seems to be consistent with the fact that d 2 (xi,Xj) reaches a maximum 
at this point. However, a proper convergence proof for this case involves non- 
differentiability of the cost, and, as pointed out in the conclusion, it remains an 
open question. 

3.5 Links with information geometry 

An important concept in information geometry is the natural gradient. Let us 
show it is related to the method proposed in this paper. Suppose now that z t are 
realizations of a parametric model with parameter w E W 1 and joint probability 
density function p(z, w). Now let 

Q(z, w) = l(z, w) = \og(p(z, w)) 

be the log-likelihood of the parametric law p. If w is an estimator of the true 
parameter w* based on k realizations of the process z\, ■ ■ ■ ,z k the covariance 
matrix is larger than the Cramer-Rao bound: 

E[(w - w*){w - w*) T ] > ]-G{w*)- 1 

with G the Fisher information matrix G{w) = -E z [(y^l(z,w))(y^,l(z,w)) T ] 
where V s denotes the conventional gradient in Euclidian spaces. As G(w) is a 
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positive definite matrix it defines a Riemannian structure on the state space M. = 
R™, known as the Fisher information metric. In this chart the Riemannian gradient 
of Q(z,w) writes G^ 1 (w) Vf l{z, w) (see Appendix). As M. = R n , a simple 
retraction is the addition R w (u) = w + u. Taking j t = 1/t which is compatible 
with assumption 1, update © writes w t+ \ = w t — \G~ l (w t ) Vf l(z u w t ). This is 
the celebrated Amari's natural gradient (3]|. 

Assuming w t converges to the true parameter w*, Amari proves it is an effi- 
cient estimator. Indeed, letting V t = E[(w t — w*)(w t — w*) T ] we have 

V t+1 = V t - 2EljG- 1 V*l(z t ,w t )(w t - w*) T ) + j 2 G- l GG- 1 + O(i) 

But up to second order terms V%l(z t , w t ) = V%l(z t , w*) + (V^) 2 /(^, w*)(w t - 
w*), withE[V^Z(z, w*)] = as w* achieves a minimum and G(w) = E[(Vf) 2 /(z<, 
because of the basic properties of the Cramer Rao bound. Finally V t+ i — V t — 
2V t /t + G _1 /t 2 , up to terms whose average can be neglected as the Cramer Rao 
bound ensures E(\\w t — w*\ | 2 ) is at least 0(l/t). The asymptotic solution of this 
equation is V t = G^ 1 jt + 0(l/i 2 ) proving statistical efficiency. 

It completes our convergence result, proving that when the space is endowed 
with the Fisher metric, and the trivial retraction is used, the stochastic gradient 
method proposed in this paper provides an efficient estimator. The natural gradi- 
ent has been applied to blind source separation (BSS) and has proved to lead to 
substantial performance gains. 

[29] has recently derived an intrinsic Cramer- Rao bound. The bound does 
not depend on any non-trivial choice of coordinates, i.e. the estimation error 
\w — 1X7* || is replaced with the Riemannian distance associated to the Fisher in- 
formation metric. In the same way, the usual natural gradient update w t+ i = 
w t — \G~ l W^ l(z t , w t ) could be replaced with its intrinsic version © proposed 
in this paper. It can be conjectured this estimator achieves Fisher efficiency i.e. 
reaches the intrinsic Cramer-Rao bound as defined in ll29l . Such a result in the 
theory of information geometry goes beyond the scope of this paper and is left 
for future research. Finally, it is interesting to note that, a stochastic gradient de- 
scent algorithm based on a specific Riemannian metric (the so-called preferential 
metric) has also appeared in [|20ll where the authors propose a generalization of 
the exponentiated gradient method, and also establish a link with Amari's natural 
gradient. 
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4 A detailed application: identification of fixed rank 
matrices 



To illustrate the benefits of the approach on a recent non-linear problem, we focus 
in this section on the most simple algorithm of Il22l . and we prove new conver- 
gence results. Note that, Theorems 1 and 2 and the procedure of Section [5] are 
potentially applicable to the others (more complex) algorithms of ff2~2~ll2~TTl . 

Least Mean Squares (LMS) filters have been extensively utilized in adaptive 
filtering, for online regression. Let x t E M. n be the input vector, and y t be the 
output defined by y t = w T x t + v t where the unknown vector w E M. n is to be 
identified (filter weights), and v t is a noise. At each step we let z t = (x t ,yt) 
and the approximated cost function is Q(z t) w t ) = \(w T x t — yt) 2 - Applying the 
steepest descent leads to the following stochastic gradient algorithm known as 
LMS: w t+x =w t - -f t (w t T x t - y t )x t . 

We now consider a non-linear generalization of this problem coming from the 
machine learning field (see e.g. [31]), where x t E W 1 is the input, y t E R is the 
output, and the matrix counterpart of the linear noisy model is 

y t = Tr (Wx t xJ) +u t = xJWx t + v t (10) 

where the unknown matrix W E M nxn is to be identified and v t is a noise. W 
is assumed to belong to a submanifold of the cone of positive definite matrices. 
In data mining, positive semi-definite matrices W represent a kernel or a Maha- 
lanobis distance, i.e. W{j is the distance or the scalar product between instances 
i and j. We assume at each step an expert provides as a random output of the 
problem with noise. The goal is to filter the noise and estimate the matrix 
W. From a control perspective, it can be viewed as a non-linear matrix filter- 
ing problem. We will apply our stochastic gradient method to the cost function 
C(W) = E z (Q(z, W)) where Q(z t , W t ) = \{x T t W t x t - y t f = \{y t - y t f and 
zt = {xt,Vt)- 

4.1 Estimating a positive definite matrix 

To endow P n with a metric we depart from the factorization W = GG T , G E 
GL(n), where GL(n) is the set of all invertible n x n matrices. Because the 
factorization is invariant by rotation, G n- GO, O E 0(n), the search space 
is identified to the quotient P n ~ GL(n)/0(n),. which represents the set of 
equivalence classes 

[G] = {GO s.t. O E 0(n)}. 

The Euclidian metric <fe(Ai, A 2 ) = Tr (Af A 2 ), for Ai, A 2 E R nxn tangent vec- 
tors at G in GL(n), is invariant along the equivalence classes. As a consequence, 
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it induces a well-defined metric g\q\ (£, Q on the quotient, i.e. for £, £ tangent vec- 
tors at [G] in P n . Classically [1J, the tangent vectors of the quotient space P n are 
identified to the projection on the horizontal space (the orthogonal space to [G]) 
of tangent vectors of the total space GL(n). So tangent vectors at [G] are repre- 
sented by the set of horizontal tangent vectors {Sym(A)G, A e R nx "}, where 
Sym(A) = (A + A T )/2. The horizontal gradient of Q(z t , G t ) is the unique hor- 
izontal vector H(z t , G t ) that satisfies the definition of the Riemannian gradient. 
Elementary computations yield H(z t , G t ) = 2(y t — y t )x t Xt T Gt, and © writes 

G t+1 = G t - -^-(\\Gjx t \\ 2 - y t )x t xjG t (11) 

where we choose f(G t ) = max(l, ||GJ 2 ). Theorem 1 implies the following 
sufficient condition for convergence: 

Proposition 2 Let (x t )t>o be a sequence of centered bounded vectors with sta- 
tistically independent identically distributed coordinates, with moments satisfying 
E(xf) > (E(x 2 )) 2 > 0. Let y t = xJWx t be generated by some positive definite 
unknown matrix W. The Riemannian gradient descent (fTTT) . under assumption 
1, is such that W t = G t Gj converges a.s. to a matrix W^. IfWoo is full rank, 
W OQ = W. 

The proof is as follows. First of all (fTTT) does not strictly satisfy assumption 5 but 
satisfies instead \\H(z t , G t )/ f(G t )\\ 2 < A + P||Gi|| 2 . We could of course have 
chosen to set f(G t ) = max(l, ||Gt|| 3 ) which is consistent with assumption 5, but 
the reasons for this slightly different choice will clearly appear in the next section. 
Here it is easier to prove the compactness of the trajectories embedding the param- 
eter in the Euclidian space, and referring to the proof in the Euclidian case (which 
will allow to prove compactness of the trajectories thanks to the specific term 
f(G t ) introduced in this paper). Indeed this avoids to compute the sectional cur- 
vature of the manifold at each point. The proof in the manifold case will then al- 
low to conclude. Consider the linear (matrix) map U : M i-> E(Tr (xx T M) xx T ). 
E(xV) = for i ^ j, and if we let E((a; i ) 2 ) = a > and E((x*) 4 ) = b > a 2 , 
then U(M) is the matrix whose coordinates are a 2 (Mij + Mji) for i j, and 
Tr (M) a 2 + M u {b - a 2 ) for i = j. 

First, we can prove that for ||G|| 2 sufficiently large, Tr ([E(Tr (xx T {GG T - W)) xx T G}G T )) > 
e > 0, which means the gradient tends to make the norm of the parameter decrease 
in average, when it is too far from the origin. Indeed let P = GG T . We want to 
prove that Tr (U(P)P) > Tr (U(W)P) + e for sufficiently large ||G|| 2 = Tr (P). 
If we choose a basis in which P is diagonal we have Tr (U(P)P) = a 2 Tr (P) 2 + 
(b - a 2 )Tr (P 2 ) = a 2 A;) 2 + (b - a 2 )(J] A 2 ) where A 1; . . . , A n are the eigen- 
values of P. We have also Tr (U(W)P) = a 2 Tr (P) Tr (W) + (b- a 2 ) ^(A^). 
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For sufficiently large Tr (P), Tr (P) 2 can obviously be made arbitrary larger than 
Tr (P) Tr (W), and as (J2 A 2 ) 1/2 > \ £ \ from Cauchy-Schwarz inequality, for 
sufficiently large Tr (P) we have £ A 2 > (£(A 2 ) E(W 2 )) 1/2 > £(W«). 

Letting /i 4 = max(P, ||GJ 2 ) we can go all over the proof of Theorem 1 to 
prove compactness, but as in the Euclidian case i.e. the distance is replaced by the 
Euclidian norm. As we have just proved Tr f[E(Tr {xx T {GG T — W)) xx T G]G T )) > 
e for ||G|| 2 sufficiently large, we end up with the inequality E[/i t+1 — h t \F t ] < 
1t(A + Bht) as \\H(z u G t )/f(G t )\\ 2 < A + B\\G t \\ 2 . Using the same trick as in 
lfT2l , we have E [h t+ i — ( 1 — 7 2 ) h t \ F t ] < and it implies the positive variations 
of the auxiliary sequence fi t h t is bounded, where 1/ fi t — (1 — 7 2 P) • • ■ (1 — 7 2 P). 
As fji t converges this implies a.s. convergence of h t by the quasi-martingale theo- 
rem. This proves the trajectories remain in a compact set. 

Applying Theorem 1 (the end of the proof), we see W t = G t Gf converges to a 
critical point of the average cost function i.e. E x Tr (xx T (G O0 G^ o — W)) xx T Goo=Q. 
If Goo is full rank the matrix U (M)Goo = implies the symmetric matrix U(M) = 
and thus M = i.e. G M G^ = W. 

4.2 Estimating a low-rank positive semidefinite matrix 




(a) (b) (c) 

Figure 1: (a): \\G t Gf — W\\2 versus number of iterations for algorithm (fT2l) with 
model (flOl) and noise turned off v — 0. (b) y t — yt versus number of iterations for 
algorithm (fT2l) with model (TTOl) and noise turned off v = 0. (c) Same as (a) with 
model (flOl) corrupted by a Gaussian white noise of amplitude 10% of the mean 
value of y. 

Because matrix algorithms tend to be applied to computational problems of 
ever-increasing size, they need to be adapted to remain tractable, and the matrices 
dimensions need to be reduced so that the matrices are storable. A wide-spread 
remedy is to work with low-rank approximations. Any rank r approximation of a 
positive definite matrix can be factored as A = GG T where G E R nxr . It is then of 
much reduced size if r n, leading to a reduction of the numerical cost of typical 
matrix operations from 0(n 3 ) to 0(nr 2 ), i.e. linear complexity. This fact has 
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motivated the development of low-rank kernel and Mahalanobis distance learning 
[fT8ll . and geometric understanding of the set of semidefinite positive matrices of 
fixed rank: 

S + (r,n) = {W G M. nxn s.t. W = W T >z 0,iank(W) = r}. 

The generalization of the previous results to S+(r,d) is a straightforward con- 
sequence of the factorization W = GG T where G G R" xr has rank r. The 
quotient geometry of P n ~ GL(n)/0(n) is generalized to the quotient geometry 
of S+(r, d) ~ R" xr /C(r) in a straightforward way, yielding the update: 

G t+1 = G t - -±-(\\G T x t \\ 2 - y t )x t xjG t (12) 

where f(G t ) = max(l, \\G t || 2 ). This non-linear algorithm automatically enforces 
the rank and positive semi-definiteness constraints. Unfortunately in the low-rank 
case the critical points of the cost define a larger set i.e. one can assert = W 
only on the subspace spanned by the columns of Goo. However in numerical 
experiments convergence is always achieved. 

In the numerical experiments illustrated by Fig{TJ we generate input vectors 
x t G M 50 that are Gaussian vectors with mean and identity covariance matrix. 
Each coordinate of x t is set equal to zero when it is above a high predefined value 
so that the input sequence is bounded and satisfies assumptions of Proposition [21 
The output is generated via model (flOl) where W G M 50x50 has rank r = 9. 

5 Gain tuning issues 

We first propose a gain design procedure for the problem of Section HI that can in 
fact be applied to a large class of matrix identification algorithms. We also discuss 
assumptions 1 and 5. 

5.1 A symmetry-preserving procedure 

A practically important limitation of the standard LMS algorithm is that it is sensi- 
tive to the scaling of the input x t . In practice it is hard to find a gain that guarantees 
stability of the algorithm, and the usual standardization procedure (transform the 
data so that each variable has zero mean and unit variance) is not adapted to on- 
line (real-time) procedures. The Normalized least mean squares filter (NLMS) is 
a variant of the LMS algorithm that eliminates the problem normalizing with a 
power of the input i.e. w t+1 = w t - ^t{w t T x t - yt)^rr- 

The analog of a NLMS algorithm for the problem of Section |4] is highly desir- 
able: in classification applications, the algorithm must cope with a wide variety of 
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heterogeneous data, that sometimes even do not have physical units (binary data 
etc.). Here it is achieved inspiring from the recent theory of invariant observers 
[[Toll , as the key property is invariance to scalings of inputs, and parameters. For 
fixed a, j3 > 0, consider the transformations (group action) 

g s (W,x) = (e as W,e^ s x) 

that correspond to changes of units on the input and the parameter. The output 
y t = h(Wt, x t ) = xJWx t is a quasi-homogeneous function (with weights a, (3) of 
degree \x = a + 2(3 since h(g s (W, x)) = e^ s h(W, x). Let W be the true parameter, 
%jt = h(W, x t ) and let y t = h(W t ,x t ). If one writes © in the form W t +i = 
f (Wt, Xf> Vt, Vti It) , one would like the gain tuning to be insensitive to changes of 
units, i.e. 

{e as W t+1 ) = f(g s (W t ,x t ),h(g s (W t ,x t )),e» s h(W,x t ), lt ). (13) 

In other words / should be quasi-homogeneous of degree a. Essentially a nat- 
ural method to "invariantize" the algorithm is to normalize y with y (i.e. build 
a so-called invariant output error), normalize the gradient with powers of x t and 
W t so that it is quasi-homogeneous of degree a (i.e. build scalar invariants and 
an invariant frame), see IfTOl . So in the present case the theory would yield for 
example the following invariantized system 

GWi = G t - lt \og(\\G T x t \\ 2 /y t )f^G t (14) 

\\Xt || 

with the same performances as (fT2l) in practice, but easier to tune. However, as- 
sumptions 1-5 must be checked after invariantization for convergence guarantees. 



5.2 Timing j t in practice 

Assumption 1 is standard in stochastic approximation. As in the standard filter- 
ing problem, or Kalman filter theory, the more observations of a noisy constant 
process one gets, the weaker the gain of the filter becomes. If the gain remains 
high, the noise will make the estimator oscillate around the solution. However, in 
practice too low a gain leads to slow convergence. It is generally recommended to 
set 

74 = bTTt^ (15) 

where in theory we must have e > 0, but in practice e = is a good choice. 

In a similar way, if f(w t ) is too large, the gain can be diminished and the con- 
vergence can be slow. In practice, when implemented in coordinates, algorithms 
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that scale as the parameter (see previous subsection and the algorithm of Section 
HI) are easy to tune and well-behaved. As the lower bound on / is far from tight, it 
could be tempting to relax assumption 5 in practice. However, it must be checked 
in this case (at least) experimentally the trajectories remain bounded. 

6 Conclusion 

In this paper we proposed a stochastic gradient algorihm on Riemannian mani- 
folds. Under reasonable assumptions the convergence of the algorithm was proved. 
Moreover the convergence results are proved to hold when a retraction is used, a 
feature of great practical interest. The approach is versatile, and potentially ap- 
plies to numerous non-linear problems in several fields of research such as control, 
machine learning, and signal processing. From the examples, we see the manifold 
approach is often used either to enforce a constraint, or to derive an intrinsic algo- 
rithm. 

The theoretical results of the paper provide a framework to prove the conver- 
gence of the proposed general algorithm. It is notably established how the step 
must take into account the distortion of the parameter space i.e. the curvature of 
the manifold. It was proved this distortion only plays a role in the compactness 
of trajectories proof. In practice, computing the sectional curvature can be dif- 
ficult enough and it is often more convenient to prove the trajectories remain in 
a compact by other means (by convexity arguments for averaging problems, or 
embedding the manifold in an Euclidian space and referring to the usual proof, 
for matrix identification problems). Once the trajectories are proved to remain in 
a compact set, the proof of Theorem 1 indicates only the regularity assumption is 
required to achieve a.s. convergence. 

Possible theoretical improvements include 1) extending the results of flU to 
the retraction case 2) proving the proposed stochastic gradient is Fisher efficient 
for the intrinsic Cramer-Rao bound [|29l , hence extending the results of 3) 
adapting more advanced results of the literature on stochastic approximation to 
the Riemannian case, that is to say: speed of convergence, convergence in mean, 
case of a strongly convex cost, non-differentiability of the cost. 

In the future we would like to explore two of the aforementioned applications. 
First, the matrix completion problem. Proving the convergence of the stochastic 
gradient algorithm lETTl requires to study the critical points of the averaged cost 
function. This leads to prove mathematically involved results on low-rank matrix 
identifiability, extending the highly non-trivial work of [fl4l . Then, we would like 
to prove general results for non-linear consensus on complete manifolds with a 
stochastic communication graph. In particular we hope to extend the convergence 
bounds of the gossip algorithms in the Euclidian case lfT3l to problems such as the 
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ones described in |[28l|27]|. As pointed out in Section [31 this implies to study the 
algorithm behavior under non-differentiabiliy of the cost function. 
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Appendix: Riemannian geometry background 

Let (A4, g) be a connected Riemannian manifold (see e.g. [QQ| for basic Rieman- 
nian geometry definitions). It carries the structure of a metric space whose dis- 
tance function is the arclength of a minimizing path between two points. The 
length of a curve c(t) 6 Ai being defined by 



If y is sufficiently close to i G M., there is a unique path of minimal length 
linking x and y. It is called a geodesic. The exponential map is defined as follows: 
exp x (v) is the point z E M situated on the geodesic with initial position- velocity 
(x,v) at distance \\v\\ of x. We also define exp^z) = v. The cut locus of x 
is roughly speaking the set where the geodesies starting at x stop being paths of 
minimal length (for example ir on the circle for x = 0). The least distance to the 
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cut locus is the so-called injectivity radius I at x. A geodesic ball is a ball with 
radius less than the injectivity radius at its center. 

For / : Ai — > R twice differentiable, one can define the Riemannian gradi- 
ent as the tangent vector at x satisfying ^|t=o/(exp x (ti;)) = {v,Vf(x)) g and 
the hessian as the operator such that ^|t=o(V/(exp x (to)), Vf(exp x (tv))) g = 
2(V/(x), iy 2 x f)v) g . For instance, if f(x) = \d 2 (p,x) is half the squared dis- 
tance to a point p the Riemannian gradient is V x / = — exp" 1 ^), i.e. it is a 
tangent vector at x collinear to the geodesic linking x and p, with norm d(p, x). 
Letting c(t) = exp x (tv) we have 

f(c(t)) = f(x) + t(Vv, f(x)) g + j\t - 8)(±c(s), (Vl {s) f)±c(s)) g ds. 

and thus f(exp x (tv)) — f(x) < t(Vv, f(x)) g + ^\\v \\ 2 g k, where k is abound on 
the hessian along the geodesic. 

In a given chart (i.e. a local coordinate system), if G(x) is a positive defi- 
nite matrix representing the metric at x we have (V x /, v) g — v T G(x)V x f. The 
exponential map satisfies in any chart exp x (tv) = x + tv + 0(t 2 ). But we 
have f(exp x (tv)) = f(x) + tv T V x f + 0(t 2 ) where V E denotes the conven- 
tional gradient of /. Thus the Riemannian gradient can be expressed in a chart as 
V,/ = Clr) 'V.!:/. 
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